WIP Prometheus without operator by solsson · Pull Request #67 · Yolean/ystack

solsson · 2026-03-10T20:05:12Z

We can most likely meet our metrics endpoints discovery need using conventions and kubernetes_sd_config.

While we're at it we should revisit long term storage and querying.

solsson

Requesting changes (I created the PR so I can't reject it formally).

Only core changes reviewed. I will look at the benchmark setup for long term storage later.

solsson · 2026-03-11T05:35:27Z

monitoring/prometheus-now/prometheus.yml

+    - OpenMetricsText0.0.1
+    - PrometheusProto
+    - PrometheusText1.0.0
+    - PrometheusText0.0.4


Do we need all of these? I'd prefer we avoid legacy versions.

solsson · 2026-03-11T05:35:51Z

monitoring/prometheus-now/rules/node-exporter.yml

+        expr: >-
+          sum(instance_cpu:node_cpu_top:rate5m) without (mode, cpu)
+          /
+          sum(rate(node_cpu_seconds_total[5m])) without (mode, cpu)


What's the community source for these rules?

solsson · 2026-03-11T08:45:36Z

monitoring/prometheus-now/prometheus.yml

+    metric_relabel_configs:
+      - source_labels: [__name__]
+        regex: kube_replicaset_status_observed_generation
+        action: drop


We must do service discovery using conventions + labels. Make sure that ystack uses port names along with modern community standards for prometheus discovery, then update SD config so that it has no specific targets. I'm fine with more than one SD config as long as it's clear how a pod in any namespace can match it. Also ServiceMonitor sometimes is a use case so we need an example of that in ystack.

Remove all monitoring.coreos.com CRDs (Prometheus, Alertmanager, ServiceMonitor, PodMonitor, PrometheusRule) and replace with plain Kubernetes Deployments, ConfigMaps, and scrape config. Prometheus now uses kubernetes_sd_configs for target discovery instead of operator-managed ServiceMonitor/PodMonitor CRDs. Recording rules moved from PrometheusRule CRD to a ConfigMap-mounted rules file. A configmap-reload sidecar triggers /-/reload on changes. Consolidates k3s/30-monitoring-operator + k3s/31-monitoring into a single k3s/30-monitoring base. Updates converge and validate scripts accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

v0.31.0 was not available in the container registry at experiment time. Revert this commit to restore v0.31.0 once it is published. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Deploy Thanos Receive (StatefulSet) + Query (Deployment) and GreptimeDB standalone as competing remote_write backends for the metrics-v2 experiment. Prometheus sends scraped metrics to both via remote_write for side-by-side comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Thanos wins 8.35 vs 8.00 over GreptimeDB on weighted criteria: query correctness, operational complexity, resource usage, maturity, and storage cost projection. All PromQL queries returned consistent results across all three backends. Documents deviations from the original experiment plan. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Both backends now write to versitygw object storage for storage cost comparison. Adds bucket-create jobs and S3 configuration for each. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

WARNING comment included: these overrides should not be used in production. Forces frequent block cuts so S3 uploads are visible quickly during the metrics-v2 experiment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Both backends now write to versitygw. GreptimeDB's columnar format produces 5.6x less data (252 KB vs 1.4 MB) for the same metrics workload. This flips the storage cost score and brings the weighted totals to a near-tie (Thanos 8.05 vs GreptimeDB 8.30). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

for provisioners that use a fixed IP. Use y-k8s-ingress-hosts -check before attempting -write, so provision can complete without a TTY or sudo when entries already exist. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

solsson commented Mar 11, 2026

View reviewed changes

Yolean macbot01 and others added 7 commits March 11, 2026 09:57

Downgrade alertmanager to v0.28.1 for experiment compatibility

9055afa

v0.31.0 was not available in the container registry at experiment time. Revert this commit to restore v0.31.0 once it is published. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Configure Thanos Receive and GreptimeDB to use versitygw S3 storage

ea29842

Both backends now write to versitygw object storage for storage cost comparison. Adds bucket-create jobs and S3 configuration for each. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add 5m block duration override for experiment verification

b1e137d

WARNING comment included: these overrides should not be used in production. Forces frequent block cuts so S3 uploads are visible quickly during the metrics-v2 experiment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

solsson force-pushed the metrics-v2-experiment branch from 27a6b73 to 7e3d067 Compare March 11, 2026 08:57

solsson force-pushed the metrics-v2-experiment branch from 7191acd to 77af594 Compare March 12, 2026 05:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP Prometheus without operator#67

WIP Prometheus without operator#67
solsson wants to merge 8 commits intomainfrom
metrics-v2-experiment

solsson commented Mar 10, 2026 •

edited

Loading

Uh oh!

solsson left a comment

Uh oh!

solsson Mar 11, 2026

Uh oh!

solsson Mar 11, 2026

Uh oh!

solsson Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

solsson commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

solsson left a comment

Choose a reason for hiding this comment

Uh oh!

solsson Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

solsson Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

solsson Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

solsson commented Mar 10, 2026 •

edited

Loading